Bouïra Province
CAFE A Novel Code switching Dataset for Algerian Dialect French and English
Lachemat, Houssam Eddine-Othman, Abbas, Akli, Oukas, Nourredine, Kheir, Yassine El, Haboussi, Samia, Shammur, Absar Chowdhury
The paper introduces and publicly releases (Data download link available after acceptance) CAFE -- the first Code-switching dataset between Algerian dialect, French, and english languages. The CAFE speech data is unique for (a) its spontaneous speaking style in vivo human-human conversation capturing phenomena like code-switching and overlapping speech, (b) addresses distinct linguistic challenges in North African Arabic dialect; (c) the CAFE captures dialectal variations from various parts of Algeria within different sociolinguistic contexts. CAFE data contains approximately 37 hours of speech, with a subset, CAFE-small, of 2 hours and 36 minutes released with manual human annotation including speech segmentation, transcription, explicit annotation of code-switching points, overlapping speech, and other events such as noises, and laughter among others. The rest approximately 34.58 hours contain pseudo label transcriptions. In addition to the data release, the paper also highlighted the challenges of using state-of-the-art Automatic Speech Recognition (ASR) models such as Whisper large-v2,3 and PromptingWhisper to handle such content. Following, we benchmark CAFE data with the aforementioned Whisper models and show how well-designed data processing pipelines and advanced decoding techniques can improve the ASR performance in terms of Mixed Error Rate (MER) of 0.310, Character Error Rate (CER) of 0.329 and Word Error Rate (WER) of 0.538.
- Africa > Middle East > Algeria > Bouïra Province > Bouira (0.06)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- Europe > Germany > Berlin (0.04)
- (2 more...)
Enhancing Contextual Understanding in Large Language Models through Contrastive Decoding
Zhao, Zheng, Monti, Emilio, Lehmann, Jens, Assem, Haytham
Large language models (LLMs) tend to inadequately integrate input context during text generation, relying excessively on encoded prior knowledge in model parameters, potentially resulting in generated text with factual inconsistencies or contextually unfaithful content. LLMs utilize two primary knowledge sources: 1) prior (parametric) knowledge from pretraining, and 2) contextual (non-parametric) knowledge from input prompts. The study addresses the open question of how LLMs effectively balance these knowledge sources during the generation process, specifically in the context of open-domain question answering. To address this issue, we introduce a novel approach integrating contrastive decoding with adversarial irrelevant passages as negative samples to enhance robust context grounding during generation. Notably, our method operates at inference time without requiring further training. We conduct comprehensive experiments to demonstrate its applicability and effectiveness, providing empirical evidence showcasing its superiority over existing methodologies. Our code is publicly available at: https://github.com/amazon-science/ContextualUnderstanding-ContrastiveDecoding.
- Asia > China > Jiangsu Province > Nanjing (0.05)
- Asia > Taiwan > Taiwan > Taipei (0.05)
- North America > Canada > Ontario > Toronto (0.05)
- (9 more...)
NADI 2020: The First Nuanced Arabic Dialect Identification Shared Task
Abdul-Mageed, Muhammad, Zhang, Chiyu, Bouamor, Houda, Habash, Nizar
We present the results and findings of the First Nuanced Arabic Dialect Identification Shared Task (NADI). This Shared Task includes two subtasks: country-level dialect identification (Subtask 1) and province-level sub-dialect identification (Subtask 2). The data for the shared task covers a total of 100 provinces from 21 Arab countries and are collected from the Twitter domain. As such, NADI is the first shared task to target naturally-occurring fine-grained dialectal text at the sub-country level. A total of 61 teams from 25 countries registered to participate in the tasks, thus reflecting the interest of the community in this area. We received 47 submissions for Subtask 1 from 18 teams and 9 submissions for Subtask 2 from 9 teams.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Africa > Middle East > Djibouti (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- (63 more...)